AITopics | noisy document

Collaborating Authors

noisy document

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

4fca3029c9ead4551937ed6987502e5f-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 21:59:39 GMT

i2dformer, unseen class, zero-shot learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > Japan (0.04)
South America > Venezuela (0.04)
South America > Bolivia (0.04)
(6 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification

Neural Information Processing SystemsAug-14-2025, 19:17:50 GMT

Despite the tremendous progress in zero-shot learning (ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name.

i2dformer, unseen class, zero-shot learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > Japan (0.04)
South America > Venezuela (0.04)
South America > Bolivia (0.04)
(5 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation

Chang, Chia-Yuan, Jiang, Zhimeng, Rakesh, Vineeth, Pan, Menghai, Yeh, Chin-Chia Michael, Wang, Guanchu, Hu, Mingzhi, Xu, Zhichao, Zheng, Yan, Das, Mahashweta, Zou, Na

arXiv.org Artificial IntelligenceDec-31-2024

Large Language Models (LLMs) are becoming essential tools for various natural language processing tasks but often suffer from generating outdated or incorrect information. Retrieval-Augmented Generation (RAG) addresses this issue by incorporating external, real-time information retrieval to ground LLM responses. However, the existing RAG systems frequently struggle with the quality of retrieval documents, as irrelevant or noisy documents degrade performance, increase computational overhead, and undermine response reliability. To tackle this problem, we propose Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG), a training-free RAG framework that leverages multiple LLM agents to collaboratively filter and score retrieved documents. Specifically, MAIN-RAG introduces an adaptive filtering mechanism that dynamically adjusts the relevance filtering threshold based on score distributions, effectively minimizing noise while maintaining high recall of relevant documents. The proposed approach leverages inter-agent consensus to ensure robust document selection without requiring additional training data or fine-tuning. Experimental results across four QA benchmarks demonstrate that MAIN-RAG consistently outperforms traditional RAG approaches, achieving a 2-11% improvement in answer accuracy while reducing the number of irrelevant retrieved documents. Quantitative analysis further reveals that our approach achieves superior response consistency and answer accuracy over baseline methods, offering a competitive and practical alternative to training-based solutions.

main-rag, noisy document, preprint arxiv, (15 more...)

arXiv.org Artificial Intelligence

2501.00332

Country:

Europe > Greece (0.14)
North America > United States > Connecticut > New Haven County > New Haven (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(23 more...)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Sports > Soccer (1.00)
Government > Military (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context

Yu, Sangwon, Kim, Ik-hwan, Song, Jongyoon, Lee, Saehyung, Park, Junsung, Yoon, Sungroh

arXiv.org Artificial IntelligenceOct-9-2024

Multi-hop reasoning, which requires multi-step reasoning based on the supporting documents within a given context, remains challenging for large language models (LLMs). LLMs often struggle to filter out irrelevant documents within the context, and their performance is sensitive to the position of supporting documents within that context. In this paper, we identify an additional challenge: LLMs' performance is also sensitive to the order in which the supporting documents are presented. We refer to this as the misordered context problem. To address this issue, we propose a simple yet effective method called context repetition (CoRe), which involves prompting the model by repeatedly presenting the context to ensure the supporting documents are presented in the optimal order for the model. Using CoRe, we improve the F1 score by up to 30%p on multi-hop QA tasks and increase accuracy by up to 70%p on a synthetic task. Additionally, CoRe helps mitigate the well-known "lost-in-the-middle" problem in LLMs and can be effectively combined with retrieval-based approaches utilizing Chain-of-Thought (CoT) reasoning.

language model, reasoning, repetition, (14 more...)

arXiv.org Artificial Intelligence

2410.07103

Country:

Oceania > Australia (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > Singapore (0.04)
(3 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Leisure & Entertainment (0.69)
Media > Film (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

Choi, Juhwan, Yun, Jungmin, Jin, Kyohoon, Kim, YoungBin

arXiv.org Artificial IntelligenceApr-15-2024

The quality of the dataset is crucial for ensuring optimal performance and reliability of downstream task models. However, datasets often contain noisy data inadvertently included during the construction process. Numerous attempts have been made to correct this issue through human annotators. However, hiring and managing human annotators is expensive and time-consuming. As an alternative, recent studies are exploring the use of large language models (LLMs) for data annotation. In this study, we present a case study that extends the application of LLM-based data annotation to enhance the quality of existing datasets through a cleansing strategy. Specifically, we leverage approaches such as chain-of-thought (CoT) and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset, which is widely used for the multi-document summarization task. Through our proposed cleansing method, we introduce an enhanced Multi-News+. By employing LLMs for data cleansing, we demonstrate an efficient and effective approach to improving dataset quality without relying on expensive human annotation efforts.

dataset, noisy document, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2404.09682

Country:

South America > Suriname (0.04)
North America > Central America (0.04)
South America > Brazil (0.04)
(5 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Leisure & Entertainment (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Benchmarking Large Language Models in Retrieval-Augmented Generation

Chen, Jiawei, Lin, Hongyu, Han, Xianpei, Sun, Le

arXiv.org Artificial IntelligenceDec-20-2023

Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.

information, knowledge, llm, (15 more...)

arXiv.org Artificial Intelligence

2309.01431

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Germany (0.05)
Asia > Middle East > Qatar (0.04)
(9 more...)

Genre:

Research Report (1.00)
Personal > Honors (0.50)

Industry:

Media > Film (0.68)
Leisure & Entertainment > Sports > Olympic Games (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models

Yu, Wenhao, Zhang, Hongming, Pan, Xiaoman, Ma, Kaixin, Wang, Hongwei, Yu, Dong

arXiv.org Artificial IntelligenceNov-15-2023

Retrieval-augmented language models (RALMs) represent a substantial advancement in the capabilities of large language models, notably in reducing factual hallucination by leveraging external knowledge sources. However, the reliability of the retrieved information is not always guaranteed. The retrieval of irrelevant data can lead to misguided responses, and potentially causing the model to overlook its inherent knowledge, even when it possesses adequate information to address the query. Moreover, standard RALMs often struggle to assess whether they possess adequate knowledge, both intrinsic and retrieved, to provide an accurate answer. In situations where knowledge is lacking, these systems should ideally respond with "unknown" when the answer is unattainable. In response to these challenges, we introduces Chain-of-Noting (CoN), a novel approach aimed at improving the robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios. The core idea of CoN is to generate sequential reading notes for retrieved documents, enabling a thorough evaluation of their relevance to the given question and integrating this information to formulate the final answer. We employed ChatGPT to create training data for CoN, which was subsequently trained on an LLaMa-2 7B model. Our experiments across four open-domain QA benchmarks show that RALMs equipped with CoN significantly outperform standard RALMs. Notably, CoN achieves an average improvement of +7.9 in EM score given entirely noisy retrieved documents and +10.5 in rejection rates for real-time questions that fall outside the pre-training knowledge scope.

information, language model, ralm, (14 more...)

arXiv.org Artificial Intelligence

2311.0921

Country:

North America > United States > Illinois > Cook County > Chicago (0.05)
North America > United States > California > Los Angeles County > Los Angeles (0.04)
North America > Canada > Ontario > Middlesex County > London (0.04)
(2 more...)

Genre:

Research Report (1.00)
Overview (0.88)

Industry:

Media > Film (0.93)
Leisure & Entertainment > Sports > Olympic Games (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Automatic Authorship Attribution of Noisy Documents

Sayoud, Halim (University of Sciences and Technology Houari Boumediene (USTHB)) | Khennouf, Salah (University of Sciences and Technology Houari Boumediene (USTHB)) | Benzerroug, Hocine ( Independent Researcher ) | Hamadache, Zohra (University of Sciences and Technology Houari Boumediene (USTHB)) | Hadjadj, Hassina (University of Sciences and Technology Houari Boumediene (USTHB)) | Ouamour, Siham (University of Sciences and Technology Houari Boumediene (USTHB))

AAAI ConferencesMay-16-2017

In this survey, we conduct an investigation on the robustness of several features and classifiers in automatic authorship attribution. Our corpus consists in 25 different documents written by 5 different American philosophers in English. The different documents pass throw a digital conversion into grey-scaled images and several levels of noise are added to corrupt those image documents. The noise consists in a “Salt & Pepper” type, which is randomly added on the surface of the images with the following noise levels: 0%, 1%, 2%, 3%, 4%, 5%, 6% and 7%. Thus, each image goes throw an OCR program (Optical Character Recognition) to extract the text from the image. Then, the obtained text document is kept to be used during the experiments of authorship attribution. Several features and classifiers are employed and evaluated with regards to the classification performances. Results are quite interesting and show that the most robust feature in au-thorship attribution is the character-tetragram, which provides a score of 100% even at a noise level of 7%.

automatic authorship attribution, machine learning, optical character recognition, (2 more...)

AAAI Conferences

The Thirtieth International Flairs Conference

Genre: Overview (0.53)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.53)
Information Technology > Artificial Intelligence > Machine Learning (0.53)

Add feedback